The objectives of this notebook are to analyze the results from the first follow up round of the Rwanda long term soil health study.
See section with Notes for Nathaniel
See section with Notes for Patrick and Step
Paired Yield and Soil ids are a mess. We lose a lot of observations due to unreconciliable duplicates or ids that simply don’t have a match. We lose almost 500 observations.
TODO - check projection from baseline maps, are they shifted over? TODO - how to connect photos to farmers for enumerators
I’m going to load the baseline data from the baseline analysis. The report and data can be found here. I’ll load the new data directly from CommCare. The original baseline data object was d but I’m going to make it b. Each subsequent round will be r1, r2 and so on.
Overall I want to bring in 3 data sources:
dataDir <- normalizePath(file.path("..", "..", "data"))
forceUpdateAll <- FALSE
baselineDir <- normalizePath(file.path("..", "rw_baseline", "data"))
load(file=paste0(baselineDir, "/shs rw baseline full soil.Rdata")) # obj d
b <- baseVars
Context point: The baseline data has 2439 rows. This is 9 fewer rows than we expected in the baseline. This is because of some farmers not being surveyed as expected. See the baseline report for more details. Also, these baesline values have te
Alex Villec wrote a cleaning script to deal with the first round of Rwanda SHS follow up data and make key adjustments to the data. To utilize that do file here, I’m going to download the data from Commcare, save it, and have the dofile access that file to execute. However, the original file Alex was using had different variable names than the file pulled by the API. The options from here are to just go with the file from Alex or to align the variable names between his version and the CC version. It’s valuable to have the data directly from CC but it’ll involve more work up front
source("../oaflib/commcareExport.R")
r <- getFormData("oafrwanda", "M&E", "16B Ubutaka (Soil)", forceUpdate = forceUpdateAll)
[1] "found fdd434a62c6512b320a4cb8c4fb872a"
write.csv(r, file="rawCcR1Data.csv", row.names = F)
The first round of data from CommCare has 2380 observations. This leaves XX number of farmers unsurveyed in the first survey round. See this cleaning file for more information on the farmers we did not find again in the first follow up.
Here I’m going to call the STATA cleaning file to make AV’s changes to the R1 follow up data. This requires that the data from CC have the same variable names as the STATA cleaning file. I’m going to try to execute that here:
stataDir <- normalizePath(file.path("..", "rw_round_1_check"))
Here I access the soil predictions from the OAF soil lab. Patrick Bell manages the lab and Mike Barber oversees the prediction scripts.
soilDir <- normalizePath(file.path("..", "..", "OAF Soil Lab Folder", "Projects", "rw_shs_second_round", "4_predicted", "other_summaries"))
soil <- read.csv(file=paste(soilDir, "combined-predictions-including-bad-ones.csv", sep = "/"))
idDir <- normalizePath(file.path("..", "..", "OAF Soil Lab Folder", "Projects", "rw_shs_second_round", "5_merged"))
Identifiers <- read_excel(paste(idDir,"database.xlsx",sep="/"), sheet=1)
Combine the available data by farmer and resolve merging issues. These data can be combined long as long as the variable names are consistent or wide. I’m going to combine the data long and use split type commands to aggregate the data more easily. Confirm the variable names are consistent. By advancing this code on 5/9/17, I’m for the time being ignoring the cleaning Alex did in his do file. I’ll need to go back and incorporate those changes.
TODO: see if the variables names in Alex’s raw data, shared by Nathaniel, match the data I’m downloading from commcare. If so, don’t use the var_names.xlsx sheet and instead use those variable names and Alex’s do file to preserve all of his changes.
Not many of the names are the same. I’ve downloaded the meta data from CommCare which I’ll use to simplify the cleaning of the round 1 data. I’m also going to reshape the baseline variable names to simplify the matching of baseline variables to round 1 variables.
datNames <- function(dat){
varNames = names(dat)
exVal = do.call(rbind, lapply(varNames, function(x){
val = dat[1:3,x]
return(val)
}))
out = cbind(varNames, exVal)
return(out)
}
baseNames <- datNames(b)
write.csv(baseNames, file="baseline var names.csv", row.names = F)
Load Alex’s raw data and take the variable names from this. If I can align these variable names with the data from CC I can then execute Alex’s cleaning script on the CC data and proceed with combining the data
rawDir <- normalizePath(file.path("Soil health study (year one)", "data"))
avRaw <- read.csv(paste(rawDir, "y1_shs_rwanda_28sep.csv", sep = "/"), stringsAsFactors = F)
It looks like the data from CommCare aligns with the raw data Alex worked with at info_formid which is the second index for avRaw and the 10th index for r. Let’s just try transferring them over and the work of updating the variable names through the CC codebook export may not be necessary!
varTest <- data.frame(fromcc = names(r)[10:409], fromav = names(avRaw)[2:401])
# head(varTest)
# tail(varTest)
#varTest[90:120,]
write.csv(varTest, file="variableNameCheck.csv")
It seems to line up okay (with some adjustments)! To incorporate Alex’s cleaning code I have to export the data from R to a form Stata accept, run the code, and then load the data back in.
This function will remove all strange outputs from the data from CommCare so that the STATA code works
charClean <- function(df){
df <- as.data.frame(lapply(df, function(x){
x = gsub("'", '', x)
x = gsub("^b", '', x)
x = ifelse(grepl("map object", x)==T, NA, x)
return(x)
}))
return(df)
}
r <- charClean(r)
Here is where I actually update the names in r to match Alex’s original data.
names(r)[10:409] <- names(avRaw)[2:401]
#export so stata can run - check for variable names longer than 32char
table(nchar(names(r)))
2 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 32 33 34 36 37 38 39 40
1 4 3 1 1 2 6 1 1 2 3 5 17 11 16 12 5 8 1 7 1 3 9 9 3 7 2 3 1 28 16 47 32 11
41 42 43 44 45 46 47 48 49 51 52
7 27 18 21 31 10 7 4 3 1 1
write.csv(r, file="toBeCleanedStata.csv", row.names = F)
stata("cleans_y1_shs_rwanda.do", stata.echo=F)
Now load the result of the Stata file
r <- read.csv("cleanedforR.csv", stringsAsFactors = F)
The r dataframe has many more variables than the baseline survey. This was in part expected; we added questions to the first follow up round based on lessons from the baseline. It’s also due to how the survey was set up in CommCare. Before combining the baseline and the first follow up round I need to:
multiplot <- function(..., plotlist=NULL, file, cols=1, layout=NULL) {
library(grid)
# Make a list from the ... arguments and plotlist
plots <- c(list(...), plotlist)
numPlots = length(plots)
# If layout is NULL, then use 'cols' to determine layout
if (is.null(layout)) {
# Make the panel
# ncol: Number of columns of plots
# nrow: Number of rows needed, calculated from # of cols
layout <- matrix(seq(1, cols * ceiling(numPlots/cols)),
ncol = cols, nrow = ceiling(numPlots/cols))
}
if (numPlots==1) {
print(plots[[1]])
} else {
# Set up the page
grid.newpage()
pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout))))
# Make each plot, in the correct location
for (i in 1:numPlots) {
# Get the i,j matrix positions of the regions that contain this subplot
matchidx <- as.data.frame(which(layout == i, arr.ind = TRUE))
print(plots[[i]], vp = viewport(layout.pos.row = matchidx$row,
layout.pos.col = matchidx$col))
}
}
}
toDrop <- c("appformid", "id", "domain", "metadatadeviceid")
r <- r[,!names(r) %in% toDrop]
source("../oaflib/misc.R")
names(r) <- gsub("^y1_|intro_", "", names(r))
r[r=="."] <- NA
r <- divideGps(r, "gps_coord")
The responses of the categorical variables should be regulated through CC, however, to check, make a table that shows the top ten responses in descending order and make a graph of response counts to know what to check. I’ll then capture any characters that should be numeric and convert them.
catVars <- names(r)[sapply(r, function(x){
is.character(x)
})]
enumClean <- function(dat, x, toRemove){
dat[,x] <- ifelse(dat[,x] %in% toRemove, NA, dat[,x])
return(dat[,x])
}
strTable <- function(dat, x){
varName = x
tab = as.data.frame(table(dat[,x], useNA = 'ifany'))
tab = tab[order(tab$Freq, decreasing = T),]
end = ifelse(length(tab$Var1)<10, length(tab$Var1), 10)
repOrder = paste(tab$Var1[1:end], collapse=", ")
out = data.frame(variable = varName,
responses = repOrder)
return(out)
}
# clean up known values
catEnumVals <- c("-99", "-88", "- 99", "-99.0", "88", "_88", "- 88", "0.88",
"--88", "__88", "-88.0", "99.0")
r[,catVars] <- sapply(catVars, function(y){
r[,y] <- enumClean(r,y, catEnumVals)
})
responseTable <- do.call(rbind, lapply(catVars, function(x){
strTable(r, x)
}))
A simple table to preview the values in the data. The values are ranked by frequency.
kable(responseTable)
| variable | responses |
|---|---|
| metadatauserid | c3e5e4d69726a6587d9d5739f3961b03, ab7675956342e27f3a134b45731ca6f9, a8f48eb2ccc435935cdefec31a49f512, 2da910f9aa814b352b62821db7ac30fc, 7e1b7bc7a7147b9f4ddfedab54e8e470, 43ab9369b7e43edaa7d9614594f4d1dd, 9938a37f596038d85181e4d38cff2433, bfb7f31368600aefe2c4386ad49c5126, 4a69416450e53b6e762ea707aaf80104, 089ae26df7d5ea3886dbbe3709c34013 |
| metadatausername | umushakashatsi, umushakashatsi3, umushakashatsi72, umushakashatsi42, umushakashatsi58, umushakashatsi14, umushakashatsi66, umushakashatsi7, umushakashatsi13, umushakashatsi73 |
| metadatatimestart | 2012-01-01T02:07:31.468000, 2012-01-01T21:53:26.687000, 2012-01-01T23:04:56.746000, 2012-01-06T20:14:52.707000, 2012-01-06T21:14:58.517000, 2012-01-07T01:08:44.167000, 2016-07-27T07:53:43.734000, 2016-07-27T08:39:53.902000, 2016-07-27T08:39:57.777000, 2016-07-27T08:41:57.353000 |
| metadatatimeend | 2012-01-06T20:52:59.887000, 2012-01-07T19:01:49.301000, 2012-01-07T19:04:31.323000, 2012-01-07T19:09:38.384000, 2016-07-27T09:41:47.415000, 2016-07-27T09:57:48.152000, 2016-07-27T10:43:47.085000, 2016-07-27T11:24:53.338000, 2016-07-27T11:25:03.144000, 2016-07-27T11:26:55.594000 |
| start_time | 09:00:00.000+02, 08:30:00.000+02, 09:40:00.000+02, 10:13:00.000+02, 10:36:00.000+02, 12:20:00.000+02, 09:14:00.000+02, 09:29:00.000+02, 10:14:00.000+02, 10:56:00.000+02 |
| date | 2016-08-10, 2016-08-11, 2016-08-08, 2016-08-17, 2016-08-03, 2016-08-18, 2016-08-22, 2016-08-19, 2016-08-04, 2016-08-12 |
| enum_name | Hagenimana bienvenue, MUCYOWIMIHIGO J MV, Nyandwi Anathalie, ZIMUKWIYE Dominique, Nyirangirimana jeanne, Torero pacifique, Utamuriza Jeanne, Niyidufasha nathanael, Rukundo japhet, NYIRAMPANO Bernadette |
| photo | , 1325376816129.jpg, 1325447804135.jpg, 1325452024080.jpg, 1325873951716.jpg, 1325877535600.jpg, 1325891580194.jpg, 1469601919598.jpg, 1469601990645.jpg, 1469602247216.jpg |
| district | Rutsiro, Karongi, Mugonero, Nyamasheke, Huye, Rwamagana, Gatsibo_NLWH, Gatsibo_LWH, Nyamagabe, Kayonza |
| cell_field | Rubumba, Mubuga, Nyabicwamba, NYAGATARE, Mugera, MutongoCA, Bihumbe, Busetsa, Gihumuza, Kibyagira A |
| village | Gasharu, Murambi, Rugarama, Kabeza, Karambo, Kigarama, Nyabugogo, Kabuga, Kivumu, Gasagara |
| farmer_list | Havugimana celestin, Karekezi Celestin, Mukabinyange cecile, Mukafundi Marie, Musabyimana Jean, Ndananiwe Francois, Ndayambaje Emmanuel, Nsengiyumva Augustin, Nyirahabimana seraphine, Nyiraminani Constasie |
| farmer_respond | NA, Akimana Jeannette, BIMENYANDE Djumapri, Habimana Emmanuel, Hagumagatsi Gaspard, Karekezi Celestin, Mukabinyange cecile, Mukangiriye Donatha, Mukankusi Beatrice, MUNYENSANGA Emmanuel |
| farmer_phonenumber | NA, Ntayo, 0, ntayo, Nta telephone afite, Ntayo afite, 0.0, -, nta telephone afite, Ntayo bafite |
| d_phone | NA, 0, Ntayo, ntayo, Ni wewabajijwe, -, Ntayo afite, O, Nta telephone afite, Ntayo bafite |
| neighbor_phonenumber | NA, ntayo, 0, Ntayo, 0.0, -, 0789699430, 0785275883, 7.85275883E8, 0723071668 |
| gender | female, male |
| n_tubura_season | not_a_client_3seasons, 16a 16b 17a, 16a 17a, 17a, 16a 16b, 16a, NA, 16b 17a, 16b, 16a not_a_client_3seasons |
| which_crop_16a_1 | gor |
| which_maize_seed_16a_1 | NA, gor_nsp, new_hybrid, OPV_saved, Hybride_saved, OPV_new |
| which_crop_16a_2 | NA, yum, gor, big, insina, jum, soya, ray, shy, shaz |
| which_maize_seed_16a_2 | NA, gor_nsp, Hybride, OPV_saved, OPV_new, Hybride_saved |
| fert_type1_16a | None, DAP, NA, NPK-17, urea, NPK-22, npk2555 |
| fert_type2_16a | NA, urea, None, DAP, NPK-17, NPK-22, npk2555 |
| quality_compost_16a | Good, NA, Average, Bad |
| type_compost_16a | cow, NA, goat, pig, other, plant, kitchen_waste, human, chicken |
| d_lime_16a | no_lime, NA, lime_outside, lime_tubura, both_tubura_non_tubura |
| which_crop_16b_1 | big, shy, saka, NA, jum, soya, gor, ray, nyo, yum |
| which_maize_seed_16b_1 | NA, new_hybrid, gor_nsp, OPV_new, Hybride_saved, OPV_saved |
| which_crop_16b_2 | NA, gor, yum, jum, insina, big, soya, saka, shy, ray |
| which_maize_seed_16b_2 | NA, new_hybrid, OPV_new, gor_nsp, Hybride_saved, OPV_saved |
| fert_type1_16b | None, NA, DAP, NPK-17, urea, NPK-22, npk2555 |
| fert_type2_16b | NA, None, urea, DAP, NPK-17 |
| quality_compost_16b | NA, Good, Average, Bad |
| type_compost_16b | NA, cow, pig, goat, kitchen_waste, plant, human, other, chicken |
| d_lime_16b | no_lime, NA, lime_outside, lime_tubura |
| how_use_residues | feed_animals, mulching, leave_field, compost_use, burn_field, burn_discard, sell |
| field_texture | clay_loam, loam, silty_clay_loam, sandy_clay_loam, sandy_loam, silty_loam, silty_clay, loamy_sand, sand, clay |
| field_erosion | drainageditch, nothing, radicalterrace, gradualterrace |
| crop_direction | not_applicable, NA, across_slope, down_slope |
| comments | , Ntakibazo, ntakibazo, ntayo, Ntayo, Ntazo, ntazo, Ntakibazo., Ntacyahindutse, NA |
| sample_id | 12, 137, 1503, 2044C, 2278, 2299, 2610, 2612, 2612C, 10 |
| kg_yield_hwag_16b_1 | NA |
| kg_seed_ananas_16b_2 | NA |
| kg_seed_veg_16a_1 | NA |
| kg_seed_16a_1 | N, 1, 0, 2, -, 3, 4, 5, 6, 8 |
| kg_seed_16a_2 | , NA, 0.5, 1.0, 0.25, 2.0, 3.0, 1.5, 4.0, 5.0 |
| kg_seed_16b_1 | NA, , 3.0, 2.0, 1.0, 0.5, 1.5, 4.0, 5.0, 6.0 |
| kg_seed_16b_2 | , NA, 0.5, 1.0, 0.25, 2.0, 1.5, 3.0, 4.0, 5.0 |
| kg_yield_16a_1 | NA, 50.0, 20.0, 100.0, 30.0, 10.0, 40.0, 15.0, 200.0, 5.0 |
| kg_yield_16a_2 | , NA, 20.0, 10.0, 50.0, 30.0, 0.0, 15.0, 5.0, 100.0 |
| kg_yield_16b_1 | , NA, 20.0, 30.0, 10.0, 15.0, 5.0, 50.0, 40.0, 100.0 |
| kg_yield_16b_2 | , NA, 0.0, 10.0, 5.0, 20.0, 15.0, 3.0, 40.0, 50.0 |
| gps_coord | NA, , -1.5578864555610237 30.39436791689242 1525.93 15.0, -1.5631940702424174 30.227211802604916 1659.67 15.0, -1.5639320092237632 30.227385933820276 1434.79 10.0, -1.5667398240763533 30.273551799148027 979.26 10.0, -1.567033053159622 30.277914044142907 982.39 10.0, -1.5671285398447943 30.275353919885177 560.94 10.0, -1.5685424850437755 30.248542080122405 1468.14 20.0, -1.5688621725334673 30.24841864727349 851.74 10.0 |
| unique_location | Gatsibo_NLWH2610, Gatsibo_NLWH2612, Gatsibo_NLWH2612C, Huye137, Karongi1503, Rutsiro2044C, Rutsiro2278, Rutsiro2299, Gatsibo_LWH2476, Gatsibo_LWH2476C |
repGraphs <- function(dat, x){
tab = as.data.frame(table(dat[,x], useNA = 'ifany'))
tab = tab[order(tab$Freq, decreasing = T),]
print(
ggplot(data=tab, aes(x=Var1, y=Freq)) + geom_bar(stat="identity") +
theme(legend.position = "bottom", axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(title =paste0("Composition of variable: ", x))
)
}
adminVars <- c(names(r)[grep("meta", names(r))], "start_time", "enum_name", "photo", "cell_field", "village", "farmer_respond", "farmer_phonenumber", "d_phone", "neighbor_phonenumber", "farmer_list", "unique_location", "comments", "gps_coord", "sample_id", "SSN")
nonAdminVars <- catVars[!catVars %in% adminVars]
for(i in 1:length(nonAdminVars)){
repGraphs(r, nonAdminVars[i])
}
r$female <- ifelse(r$gender=="female", 1, 0)
r$district <- ifelse(grepl("nyanza", r$district)==T, "Nyanza", r$district)
#table(r$kg_seed_16b_1)
#table(r$kg_yield_16a_2)
strtoNum <- c("kg_seed_16b_1", "kg_yield_16a_1", "kg_yield_16b_1", "kg_yield_16b_2")
r[,strtoNum] <- sapply(r[,strtoNum], function(x){as.numeric(x)})
Notes on the categorical variables:
numVars <- names(r)[sapply(r, function(x){
is.numeric(x)
})]
Basic cleaning of known issues like enumerator codes for DK, NWR, etc.
enumVals <- c(-88,-85, -99)
r[,numVars] <- sapply(numVars, function(y){
r[,y] <- enumClean(r,y, enumVals)
})
iqr.check <- function(dat, x) {
q1 = summary(dat[,x])[[2]]
q3 = summary(dat[,x])[[5]]
iqr = q3-q1
mark = ifelse(dat[,x] < (q1 - (1.5*iqr)) | dat[,x] > (q3 + (1.5*iqr)), 1,0)
tab = rbind(
summary(dat[,x]),
summary(dat[mark==0, x])
)
return(tab)
}
# remove admin vars
numAdminVars <- c(numVars[1:3])
numVarsNotAdmin <- numVars[!numVars %in% numAdminVars]
iqrTab <- do.call(plyr::rbind.fill, lapply(numVarsNotAdmin, function(y){
#print(y)
res = iqr.check(r, y)
#print(dim(res))
out = data.frame(var=rbind(y, paste(y, ".iqr", sep="")), res)
return(out)
}))
iqrTab[,2:8] <- sapply(iqrTab[,2:8], function(x){round(x,1)})
The outlier table summarizes the numeric variables with and without IQR outliers to show how the data changes based on this filter.
knitr::kable(iqrTab, row.names = F, digits = 0, format = 'html')
| var | Min. | X1st.Qu. | Median | Mean | X3rd.Qu. | Max. | NA.s |
|---|---|---|---|---|---|---|---|
| d_client_16b | 0 | 0 | 0 | 0 | 1 | 1 | NA |
| d_client_16b.iqr | 0 | 0 | 0 | 0 | 1 | 1 | NA |
| d_client_17a | 0 | 0 | 0 | 0 | 1 | 1 | NA |
| d_client_17a.iqr | 0 | 0 | 0 | 0 | 1 | 1 | NA |
| age | 16 | 35 | 45 | 47 | 57 | 90 | NA |
| age.iqr | 16 | 35 | 45 | 47 | 57 | 90 | NA |
| n_household | 0 | 4 | 5 | 5 | 7 | 39 | NA |
| n_household.iqr | 0 | 4 | 5 | 5 | 7 | 11 | NA |
| n_cows | 0 | 0 | 1 | 1 | 1 | 15 | NA |
| n_cows.iqr | 0 | 0 | 1 | 1 | 1 | 2 | NA |
| n_goats | 0 | 0 | 0 | 1 | 2 | 18 | NA |
| n_goats.iqr | 0 | 0 | 0 | 1 | 2 | 5 | NA |
| n_chickens | 0 | 0 | 0 | 1 | 1 | 40 | NA |
| n_chickens.iqr | 0 | 0 | 0 | 0 | 0 | 2 | NA |
| n_pigs | 0 | 0 | 0 | 0 | 1 | 11 | NA |
| n_pigs.iqr | 0 | 0 | 0 | 0 | 1 | 2 | NA |
| n_sheep | 0 | 0 | 0 | 0 | 0 | 35 | NA |
| n_sheep.iqr | 0 | 0 | 0 | 0 | 0 | 0 | NA |
| field_length | 0 | 13 | 20 | 26 | 32 | 214 | NA |
| field_length.iqr | 0 | 13 | 20 | 23 | 30 | 60 | NA |
| field_width | 0 | 12 | 20 | 24 | 31 | 160 | NA |
| field_width.iqr | 0 | 12 | 20 | 22 | 30 | 59 | NA |
| n_spots | 3 | 3 | 3 | 4 | 5 | 5 | NA |
| n_spots.iqr | 3 | 3 | 3 | 4 | 5 | 5 | NA |
| fert_kg1_16a | 0 | 1 | 2 | 4 | 5 | 80 | 1408 |
| fert_kg1_16a.iqr | 0 | 1 | 2 | 3 | 4 | 11 | 1408 |
| fert_kg2_16a | 0 | 0 | 0 | 2 | 2 | 200 | 1198 |
| fert_kg2_16a.iqr | 0 | 0 | 0 | 1 | 2 | 5 | 1198 |
| d_compost_16a | 0 | 1 | 1 | 1 | 1 | 1 | 271 |
| d_compost_16a.iqr | 1 | 1 | 1 | 1 | 1 | 1 | 271 |
| kg_compost_16a | 0 | 100 | 200 | 268 | 300 | 20000 | 613 |
| kg_compost_16a.iqr | 0 | 100 | 191 | 205 | 300 | 600 | 613 |
| kg_lime_16a | 0 | 15 | 40 | 66 | 100 | 500 | 2345 |
| kg_lime_16a.iqr | 0 | 10 | 25 | 52 | 100 | 150 | 2345 |
| fert_kg1_16b | 0 | 1 | 2 | 4 | 4 | 100 | 1964 |
| fert_kg1_16b.iqr | 0 | 1 | 2 | 2 | 3 | 8 | 1964 |
| fert_kg2_16b | 0 | 0 | 0 | 0 | 0 | 88 | 1656 |
| fert_kg2_16b.iqr | 0 | 0 | 0 | 0 | 0 | 0 | 1656 |
| d_compost_16b | 0 | 0 | 1 | 0 | 1 | 1 | 529 |
| d_compost_16b.iqr | 0 | 0 | 1 | 0 | 1 | 1 | 529 |
| kg_compost_16b | 0 | 100 | 160 | 238 | 300 | 10000 | 1411 |
| kg_compost_16b.iqr | 0 | 100 | 150 | 193 | 250 | 600 | 1411 |
| kg_lime_16b | 1 | 10 | 25 | 59 | 50 | 650 | 2353 |
| kg_lime_16b.iqr | 1 | 10 | 25 | 32 | 50 | 100 | 2353 |
| field_slope | -5 | 3 | 6 | 9 | 14 | 60 | NA |
| field_slope.iqr | -5 | 3 | 6 | 9 | 14 | 30 | NA |
| field_n_crops | 0 | 1 | 1 | 2 | 2 | 30 | 343 |
| field_n_crops.iqr | 0 | 1 | 1 | 1 | 2 | 3 | 343 |
| kg_seed_16b_1 | 0 | 1 | 2 | 5 | 4 | 500 | 754 |
| kg_seed_16b_1.iqr | 0 | 1 | 2 | 3 | 4 | 10 | 754 |
| kg_yield_16a_1 | 0 | 15 | 34 | 73 | 80 | 6000 | 1570 |
| kg_yield_16a_1.iqr | 0 | 12 | 30 | 41 | 50 | 170 | 1570 |
| kg_yield_16b_1 | 0 | 8 | 20 | 53 | 50 | 6000 | 600 |
| kg_yield_16b_1.iqr | 0 | 8 | 20 | 28 | 40 | 112 | 600 |
| kg_yield_16b_2 | 0 | 3 | 10 | 25 | 25 | 600 | 1954 |
| kg_yield_16b_2.iqr | 0 | 3 | 8 | 13 | 20 | 55 | 1954 |
| yield_compare_16a_1 | 1 | 1 | 1 | 2 | 3 | 3 | 1506 |
| yield_compare_16a_1.iqr | 1 | 1 | 1 | 2 | 3 | 3 | 1506 |
| yield_compare_16a_2 | 1 | 1 | 2 | 2 | 2 | 3 | 1355 |
| yield_compare_16a_2.iqr | 1 | 1 | 2 | 2 | 2 | 3 | 1355 |
| yield_compare_16b_1 | 1 | 1 | 1 | 2 | 2 | 3 | 358 |
| yield_compare_16b_1.iqr | 1 | 1 | 1 | 2 | 2 | 3 | 358 |
| yield_compare_16b_2 | 1 | 1 | 1 | 2 | 2 | 3 | 1734 |
| yield_compare_16b_2.iqr | 1 | 1 | 1 | 2 | 2 | 3 | 1734 |
| lat | -3 | -2 | -2 | -2 | -2 | -2 | 497 |
| lat.iqr | -3 | -2 | -2 | -2 | -2 | -2 | 497 |
| lon | 29 | 29 | 30 | 30 | 30 | 31 | 497 |
| lon.iqr | 29 | 29 | 30 | 30 | 30 | 31 | 497 |
| alt | -108 | 1513 | 1673 | 1668 | 1887 | 2668 | 497 |
| alt.iqr | 957 | 1541 | 1680 | 1728 | 1887 | 2430 | 497 |
| precision | 5 | 10 | 15 | 19 | 15 | 4181 | 497 |
| precision.iqr | 5 | 10 | 15 | 13 | 15 | 20 | 497 |
| female | 0 | 0 | 1 | 1 | 1 | 1 | NA |
| female.iqr | 0 | 0 | 1 | 1 | 1 | 1 | NA |
# http://rforpublichealth.blogspot.com/2014/02/ggplot2-cheatsheet-for-visualizing.html
for(i in 1:length(numVarsNotAdmin)){
base <- ggplot(r, aes(x=r[,numVarsNotAdmin[i]])) + labs(x = numVarsNotAdmin[i])
temp1 <- base + geom_density()
temp2 <- base + geom_histogram()
#temp2 <- boxplot(r[,numVars[i]],main=paste0("Variable: ", numVars[i]))
multiplot(temp1, temp2, cols = 2)
}
Here is where I will clean soil values before merging them in.
First merge the soil data with the identifiers as we should get full matches. Then merge soil data to the survey data
Identifiers <- Identifiers %>% rename(
sample_id = `Sample ID`,
SSN = `Lab ssn`
) %>% mutate(
sample_id = gsub(" ", "", tolower(sample_id))
)
table(Identifiers$SSN %in% soil$SSN) # full matches
TRUE
2426
soil <- left_join(soil, Identifiers[, c("SSN", "sample_id")], by="SSN")
We have some surveys that don’t have soil data. It seems the soil sample id in the Identifiers data are a bit messy. Let’s clean both up above by removing spaces and making lower case.
r$sample_id <- tolower(r$sample_id)
table(r$sample_id %in% soil$sample_id)
FALSE TRUE
28 2366
r$sample_id[!r$sample_id %in% soil$sample_id]
[1] "1062c" "1198c" "1212" "1228" "1242" "1380c" "1384c" "1626c" "204" "2042c" "2175" "2415"
[13] "2418" "2418c" "2426" "2426c" "2534" "2561c" "2636c" "2671c" "2696" "2741" "2819" "2979"
[25] "596c" "65c" "66c" "931"
write.csv(r$sample_id[!r$sample_id %in% soil$sample_id], "surveysWoSoil.csv", row.names = F)
And some soil sample_id that don’t have a survey
soil$sample_id[!soil$sample_id %in% r$sample_id]
[1] "137c" "569c" "902" "902c" "903" "903c" "904" "904c" "909" "909c" "912" "912c"
[13] "931c" "946" "946c" "947" "947c" "953" "953c" "954" "954c" "962" "962c" "964"
[25] "966c" "967" "968c" "969c" "970" "970c" "971" "971c" "973" "975" "975c" "1061c"
[37] "1062" "1096" "1096c" "1102" "1102c" "1103" "1103c" "1105" "1105c" "1159" "1159c" "1162c"
[49] "1203" "1359" "1372" "1432c" "1437" "1501" "1503c" "1538" "2215" "2204" "2350c" "2355"
[61] "2368" "2625c" "956c" "2685c" "2819c" "2634" "2850c" "1189c"
write.csv(soil$sample_id[!soil$sample_id %in% r$sample_id], "soilsWoSurvey.csv", row.names = F)
dim(r)
[1] 2394 93
r <- left_join(r, soil, by="sample_id")
dim(r) # why is it one row longer after the left_join?
[1] 2395 115
ggplot(r, aes(x=Calcium, y=Magnesium)) + geom_point() +
stat_smooth(method="loess") +
labs(x = "Calcium (m3)", y= "Magnesium (m3)", title="Calcium and Magnesium relationship")
ggplot(r, aes(x=pH, y=Calcium)) + geom_point() +
stat_smooth(method="loess") +
labs(x = "pH", y="Calcium (m3)", title = "pH and Calcium relationship")
ggplot(r, aes(x=pH, y=Magnesium)) + geom_point() +
stat_smooth(method="loess") +
labs(x = "pH", y="Magnesium (m3)", title = "pH and Magnesium relationship")
ggplot(r, aes(x=pH, y=X.Exchangeable.Acidity)) + geom_point() +
stat_smooth(method="loess") +
labs(x = "pH", y="Exchangeable Aluminum", title = "pH and Aluminum relationship")
ggplot(r, aes(x=X.Organic.Carbon, y=X.Total.Nitrogen)) + geom_point() +
stat_smooth(method="loess") +
labs(x = "Total Carbon", y="Total Nitrogen", title = "Carbon and Nitrogen relationship")
soilVars <- names(r)[which(names(r)=="pH"):which(names(r)=="X.Total.Nitrogen")]
Please note: These are raw comparisons and thus should not be taken as initial findings for how T and C farmers compare. Farmers will be matched to ensure a proper comparison.
for(i in 1:length(soilVars)){
p1 <- ggplot(data=r, aes(x=as.factor(d_client_16b), y=r[,soilVars[i]])) +
geom_boxplot() +
labs(x="Tubura Farmer", y=soilVars[i])
p2 <- ggplot(data=r, aes(x=r[,soilVars[i]])) +
geom_density() +
labs(x=soilVars[i])
multiplot(p1, p2, cols=2)
}
I’m seeing that there are duplicated farmers in the data when I’m trying to reshape the r data from wide to long. Let’s check them out here and see if we can figure out which observation is right.
length(r$sample_id)==length(unique(r$sample_id))
[1] FALSE
dups <- r$sample_id[duplicated(r$sample_id)]
dupIndex <- which(duplicated(r$sample_id))
#dupDat <- r[r$sample_id %in% dups,]
#head(r[r$sample_id==dups[1],])
#head(r[r$sample_id==dups[2],])
Let’s solve the unique id issue by looking at identifying information in the baseline data
roundId <- r %>%
dplyr::select(district, cell_field, village, sample_id, farmer_list) %>%
filter(r$sample_id %in% dups)
#d
load("rawBaselineWithIdentifers.Rdata")
baseId <- d %>%
dplyr::select(district, selected_cell, umudugudu, sample_id, farmer_name ) %>%
filter(d$sample_id %in% dups)
#baseId
#roundId
Correct the duplicates I can and drop the others for now. Flag the duplicated ones and save them to share with Nathaniel.
TODO(mattlowes) - share any remaining duplicates with Nathaniel and see if he has a solution. Also see if he can understand why this might have happened and if they should actually have a different sample id.
r <- r %>% mutate(
dup = ifelse(
sample_id == "12" & cell_field == "MUNANIRA" |
sample_id == "137" & village == "Rusuma" |
sample_id == "1503" & farmer_list=="NAKAGIZE Val\\xc3\\xa9rie" |
#sample_id == "2044C" & # same!
sample_id == "2278" & cell_field=="Nkira A" | # check this as maybe this was the only thing wrong?
#sample_id == "2299" & # same!
sample_id == "2610" & village=="agakiri" #| #agakiri is close to gakiri in spelling. Is this just a typo?
#sample_id == "2612" & # same names!
#sample_id == "2612C" # same names!
, 1, 0)
) %>% filter(
dup!=1
) %>% dplyr::select(-dup)
# run this code again from above to get updated duplicates list
#length(r$sample_id)==length(unique(r$sample_id))
dups <- r$sample_id[duplicated(r$sample_id)]
dupIndex <- which(duplicated(r$sample_id))
# for the time being drop the observations that are duplicates
r <- r[!r$sample_id %in% dups,]
This should include the baseline variables as well.
Let’s first check with the baseline data to see what variables we made there so I can make the same ones from the round 1 data. There are some variables that are baseline variables only like variables asking about historical practices. There are then other variables that will vary by season. These are the variables that we ultimately want in to shape in a long dataset by season to analyze changes overtime in practices and soil management. I think this will result in a dataset that has one row per farmer per season. Some variables may not fit nicely into this but we can deal with those. For variables that aren’t changing over time they’ll show as not important in our model. They’re important for matching farmers.
There are a lot of variables to try to line up. Some already have the same name but how to best combine the ones that have different variable names? I’m going to write a function that takes a variable name from b and a variable name from r that should go together, updates the r variable name and uses that info to rbind the data into a long dataset.
# names(b)
# names(r)
# check the names that already match
baselineFound <- names(b)[names(b) %in% names(r)] # not many variable names are aligned
Update variable names so that any variable with 16a or 16b has a the a or b season designation at the end it so I can replicate the gather() and spread() options for reorganizing the data by season and by plot. This means that the variable names will retain their designation of first or second application and be distinguishable.
TODO(mattlowes) - rename the variables according to that convention to reshape the r data. Keep the baseline data in mind as we’ll want to do the same thing with the baseline data to make them match.
r <- r %>% rename(
which_crop_1_16a = which_crop_16a_1,
which_maize_seed_1_16a = which_maize_seed_16a_1,
which_crop_2_16a = which_crop_16a_2,
which_maize_seed_2_16a = which_maize_seed_16a_2,
kg_seed_veg_1_16a = kg_seed_veg_16a_1,
kg_seed_1_16a = kg_seed_16a_1,
kg_seed_2_16a = kg_seed_16a_2,
kg_yield_1_16a = kg_yield_16a_1,
kg_yield_2_16a = kg_yield_16a_2,
yield_compare_1_16a = yield_compare_16a_1,
yield_compare_2_16a = yield_compare_16a_2,
which_crop_1_16b = which_crop_16b_1,
which_maize_seed_1_16b = which_maize_seed_16b_1,
which_crop_2_16b = which_crop_16b_2,
which_maize_seed_2_16b = which_maize_seed_16b_2,
#kg_seed_veg_1_16a = kg_seed_veg_16a_1,
kg_seed_1_16b = kg_seed_16b_1,
kg_seed_2_16b = kg_seed_16b_2,
kg_yield_1_16b = kg_yield_16b_1,
kg_yield_2_16b = kg_yield_16b_2,
yield_compare_1_16b = yield_compare_16b_1,
yield_compare_2_16b = yield_compare_16b_2
)
aSeason <- names(r)[grep("(1.a)", names(r))]
bSeason <- names(r)[grep("(1.b)", names(r))]
seasonalVars <- c(aSeason, bSeason, "sample_id")
farmerVars <- c(names(r)[!names(r) %in% seasonalVars], "sample_id")
# example data
# df <- data.frame(
# id = 1:10,
# time = as.Date('2009-01-01') + 0:9,
# Q3.2.1. = rnorm(10, 0, 1),
# Q3.2.2. = rnorm(10, 0, 1),
# Q3.2.3. = rnorm(10, 0, 1),
# Q3.3.1. = rnorm(10, 0, 1),
# Q3.3.2. = rnorm(10, 0, 1),
# Q3.3.3. = rnorm(10, 0, 1)
# )
#
# df %>%
# gather(key, value, -id, -time) %>%
# extract(key, c("question", "loop_number"), "(Q.\\..)\\.(.)") %>%
# spread(question, value)
source("../oaflib/misc.R")
# aDat <- r[,names(r) %in% aSeason] # works for this too!
# aDat <- aDat[,grep("16a_1", names(aDat))] # works for this
aDat <- r[,names(r) %in% seasonalVars] # works for this!
#http://stackoverflow.com/questions/25925556/gather-multiple-sets-of-columns
seasonalDat <- aDat %>%
gather(key, value, -sample_id) %>%
tidyr::extract(key, c("variable", "season"), "(^.*\\_1.)(.)") %>%
mutate(season = paste0("16", season)) %>%
spread(variable, value)
names(seasonalDat) <- gsub("_16", "", names(seasonalDat))
TODO(mattlowes) - confirm that the tidyr process worked as I expected as there are numerous missing values. These seem to appear where the variable only had one version of the variable, _16, rather than a _16a and a _16b. Check out how this is handling variables with _17 instead of _16.
rs <- left_join(seasonalDat, r[,c(names(r)[!names(r) %in% seasonalVars],"sample_id")], by="sample_id")
rs$dim <- rs$field_length * rs$field_width
rs$are <- rs$dim/100
inputVars <- names(rs)[grep("fert_|quality_compost|type_compost|which_crop|which_maize", names(rs))]
rs[,inputVars] <- sapply(rs[, inputVars], tolower)
# input quanitites
rs$fert_kg_urea1 <- ifelse(rs$fert_type1=="urea", rs$fert_kg1, NA)
rs$fert_kg_urea2 <- ifelse(rs$fert_type2=="urea", rs$fert_kg2, NA)
rs$fert_total_urea <- apply(rs[, grep("(urea.)", names(rs))], 1, function(x){
sum(as.numeric(x), na.rm=T)})
rs$fert_kg_dap1 <- ifelse(rs$fert_type1=="dap", rs$fert_kg1, NA)
rs$fert_kg_dap2 <- ifelse(rs$fert_type2=="dap", rs$fert_kg2, NA)
rs$fert_total_dap <- apply(rs[, grep("(dap.)", names(rs))], 1, function(x){
sum(as.numeric(x), na.rm=T)})
rs$fert_kg_17npk1 <- ifelse(rs$fert_type1=="npk-17", rs$fert_kg1, NA)
rs$fert_kg_17npk2 <- ifelse(rs$fert_type2=="npk-17", rs$fert_kg2, NA)
rs$fert_total_17npk <- apply(rs[, grep("(17npk.)", names(rs))], 1, function(x){
sum(as.numeric(x), na.rm=T)})
rs$fert_kg_22npk1 <- ifelse(rs$fert_type1=="npk-22", rs$fert_kg1, NA)
rs$fert_kg_22npk2 <- ifelse(rs$fert_type2=="npk-22", rs$fert_kg2, NA)
rs$fert_total_22npk <- apply(rs[, grep("(22npk.)", names(rs))], 1, function(x){
sum(as.numeric(x), na.rm=T)})
rs$fert_kg_2555npk1 <- ifelse(rs$fert_type1=="npk2555", rs$fert_kg1, NA)
rs$fert_kg_2555npk2 <- ifelse(rs$fert_type2=="npk2555", rs$fert_kg2, NA)
rs$fert_total_2555npk <- apply(rs[, grep("(2555npk.)", names(rs))], 1, function(x){
sum(as.numeric(x), na.rm=T)})
#lime
rs$lime_outside <- ifelse(rs$d_lime=="lime_outside", rs$kg_lime, NA)
rs$lime_tubura <- ifelse(rs$d_lime=="lime_tubura", rs$kg_lime, NA)
rs$lime_both <- ifelse(rs$d_lime=="both_tubura_non_tubura", rs$kg_lime, NA)
inputVars <- names(rs)[grep("field_length|field_width|dim|fert_kg_|fert_total_|lime_", names(rs))]
rs[,inputVars] <-sapply(rs[,inputVars], as.numeric)
#urea
rs$fert_kgare_urea1 <- rs$fert_kg_urea1/rs$are
rs$fert_kgare_urea2 <- rs$fert_kg_urea2/rs$are
rs$fert_kgare_urea_total <- rs$fert_total_urea/rs$are
#dap
rs$fert_kgare_dap1 <- rs$fert_kg_dap1/rs$are
rs$fert_kgare_dap2 <- rs$fert_kg_dap2/rs$are
rs$fert_kgare_dap_total <- rs$fert_total_dap/rs$are
#npk17
rs$fert_kgare_17npk1 <- rs$fert_kg_17npk1/rs$are
rs$fert_kgare_17npk2 <- rs$fert_kg_17npk2/rs$are
rs$fert_kgare_17npk_total <- rs$fert_total_17npk/rs$are
#npk22
rs$fert_kgare_22npk1 <- rs$fert_kg_22npk1/rs$are
rs$fert_kgare_22npk2 <- rs$fert_kg_22npk2/rs$are
rs$fert_kgare_22npk_total <- rs$fert_total_22npk/rs$are
#2555 npk
rs$fert_kgare_2555npk1 <- rs$fert_kg_2555npk1/rs$are
rs$fert_kgare_2555npk2 <- rs$fert_kg_2555npk2/rs$are
rs$fert_kgare_2555npk_total <- rs$fert_total_2555npk/rs$are
rs$season_16a <- ifelse(grepl("16a", rs$n_tubura_season), 1, 0)
rs$season_16b <- ifelse(grepl("16b", rs$n_tubura_season), 1, 0)
rs$season_17a <- ifelse(grepl("17a", rs$n_tubura_season), 1, 0)
rs$notClient3Seasons <- ifelse(grepl("not_a_client", rs$n_tubura_season), 1, 0)
fieldInputVars <- names(rs)[grep("field_length|field_width|dim|fert_kgare_", names(rs))]
for(i in 1:length(fieldInputVars)){
base <- ggplot(rs, aes(x=rs[,fieldInputVars[i]])) + labs(x = fieldInputVars[i], title=fieldInputVars[i])
temp1 <- base + geom_density()
temp2 <- base + geom_histogram()
#temp2 <- boxplot(r[,numVars[i]],main=paste0("Variable: ", numVars[i]))
multiplot(temp1, temp2, cols = 2)
}
Check field dimensions:
ggplot(rs, aes(x=field_width, y=field_length)) +
geom_point() +
labs(title= "Field dimensions", x = "Width (m)", y= "Length (m)")
library(dismo)
if (!(exists("rwanda"))){
# Only need to geocode once per session library(dismo)
rwanda <- try(geocode("Rwanda"))
# If the internet fails, use a local value
if (class(rwanda) == "try-error") {
rwanda <- ""
# arusha$longitude <- 36.68299
# arusha$latitude <- -3.386925
}
}
See here for more on using markerClusterOptions in leaflet.
In the map below, the larger green circles are Tubura farmers and the smaller blue circles are control farmers. The number of observations will appear larger on the map because it’s plot level instead of farmer level.
e <- rs[!is.na(rs$lon),]
ss <- SpatialPointsDataFrame(coords = e[, c("lon", "lat")], data=e)
pal <- colorNumeric(c("navy", "green"), domain=unique(ss$client))
map <- leaflet() %>% addTiles() %>%
setView(lng=rwanda$longitude, lat=rwanda$latitude, zoom=8) %>%
addCircleMarkers(lng=ss$lon, lat=ss$lat,
radius= ifelse(ss$client==1, 10,6),
color = pal(ss$client),
clusterOptions = markerClusterOptions(disableClusteringAtZoom=13, spiderfyOnMaxZoom=FALSE))
map
Here are the key pieces of feedback for the next survey round:
The matchRounds function updates variable names across rounds and reports the index and new name of the variables. I can then take the first part of the list for dat1 and the second part for dat2.
Or just change baseline variable names manually. What’s the best way to do this? First reshape the baseline variables to be plot level as well with a season indicator.
TODO(matt.lowes) Confirm that this is necessary. If the baseline data only includes the previous season and the history then the reshape may not be necessary. All subsequent surveys asked about two seasons, the intervening season and the relevant season. Get your head around the baseline data again and act.
# b <- b %>% rename(
# inputuse_priord_fertilizer_15b = inputuse_15b_priord_fertilizer,
# inputuse_priorculture_15b_1 = inputuse_15b_priorculture_15b_1,
# inputuse_priord_intercrop_15b = inputuse_15b_priord_intercrop_15b,
# inputuse_priorculture_in_15b = inputuse_15b_priorculture_15b_in,
# crop1_seety_15b = crop1_15b_seedty,
# #v58
# crop1_yield_15b = crop1_15b_yield,
# crop1_yield__15b = crop1_15b_yield_,
# crop2_seedty_15b = crop2_15b_seedty,
# #63
# crop2_seedkg_15b = crop2_15b_seedkg,
# crop2_yield_16b = crop2_15b_yield,
# crop2_yield__15b = crop2_15b_yield_,
# field_fert_t_15b = field_15b_fert_t,
# #v69
# field_compost_qu_15b = field_compost_qu
# )
I think that all needs to be done is to add a season variable and rename the baseline variables to take off the _15b portion.
write.csv(names(b), "baselineVars.csv", row.names = T)
write.csv(names(rs), "round1Vars.csv", row.names = T)
names(b) <- gsub("_15b", "", names(b))
b$season <- "15b"
b <- b %>% rename(
crop1_local = v58,
crop2_local = v63,
field_fert_t_1 = field_fert_t,
field_fert_t_2 = v69
)
# check what's already the same
matchNames <- names(rs)[names(rs) %in% names(b)]
# matchNames
TODO - now match all the variable names that need to be matched for the data to be appended. Ugh.
matchRounds <- function(dat1, dat2, var1, var2, new=NULL, choice="first"){
if (choice=="first"){
var2new = var1
#names(dat2)[names(dat2)==var2] <- var2new
return(list(
list(var1, grep(var1, names(dat1))),
list(var2new, grep(var2, names(dat2)))
))
} else if (choice=="second") {
var1new = var2
#names(dat1)[names(dat1)==var1] <- var1new
return(list(
list(var1new, grep(var1, names(dat1))),
list(var2, grep(var2, names(dat2)))
))
} else{
var1new = var2new = new
#names(dat2)[names(dat2)==var2] <- var2new
#names(dat1)[names(dat1)==var1] <- var1new
return(list(
list(var1new, grep(var1, names(dat1))),
list(var2new, grep(var2, names(dat2)))
))
}
}
dataSources <- c("b", "r")
namesToUpdate <- list(
c("demographicdate", "date", "first"),
c("sample", "d_sample", "second")
)
# example
dat1=b
dat2=r
var1 = "field_dim1"
var2 = "field_length"
choice="first"
test <- matchRounds(b, r, "field_dim1", "field_length", choice="first")
test2 <- matchRounds(b, r, "field_dim2", "field_width", choice="first")
test <- lapply(namesToUpdate, function(x){
val = matchRounds(x)
return(val)
})
Analysis TODO: * clean round 1 * feature creation * matching + * following previous template +
For next week: * data are together * talk with Maya about matching longitudinally * soil graphs
Same as the baseline analysis but with two seasons of data
For attributes, baseline attribute and round 1 value >> what’s the trend?
How do soil attributes predict yields (climbing beans) >> can we understand yield as functions of carbon, pH, etc. Are the curves as we might expect?
The variable names from Commcare are in Kinyarwanda and a bit of a mess. I’m going to try to use the names from the Commcare form export. Or is there a way to get this information from Commcare? Surely there must be.
bean <- getFormData("oafrwanda", "M&E", "16B ALL Isarura (Harvest)", forceUpdate = forceUpdateAll)
[1] "found 736b25426bb4f9320a07d9c42b738ea"
write.csv(bean, file="rawCcYpData.csv", row.names=F)
yieldNames <- read.table(unz("2016B Harvest2017-06-08.zip", "Forms.csv"), nrows=10, header=T, quote="\"", sep=",") # only first 10 rows
# print variable names together
write.csv(data.frame(names(bean)[1:100], names(yieldNames)[1:100]), file="matchYieldNames.csv")
# get names from cc
# appName <- "M&E"
# formName <- "16B ALL Isarura (Harvest)"
# moduleIdx=NA
appData <- getAppStructure("oafrwanda")
enNames <- getFormFromApp(appData, "M&E", "16B ALL Isarura (Harvest)")$values
# leads to duplicates
onlyVarName <- strsplit(enNames, "/", fixed=F)
newNames <- do.call(rbind, lapply(onlyVarName, function(x){
return(x[[length(x)]])
}))
names(bean)[10:length(names(bean))] <- newNames
#names(bean)[duplicated(names(bean))]
# update intercrop names so that they're unique >> manual cleaning
names(bean)[61:70] <- paste("plants_box1", names(bean)[61:70], sep="_")
names(bean)[82:91] <- paste("plants_box2", names(bean)[82:91], sep="_")
names(bean)[170] <- paste0("climbing_beans_", names(bean)[170])
names(bean)[177] <- paste0("bush_beans_", names(bean)[177])
names(bean)[171] <- paste0("climbing_beans_", names(bean)[171])
names(bean)[178] <- paste0("bush_beans_", names(bean)[178])
names(bean)[173] <- paste0("climbing_beans_", names(bean)[173])
names(bean)[180] <- paste0("bush_beans_", names(bean)[180])
names(bean)[174] <- paste0("climbing_beans_", names(bean)[174])
names(bean)[181] <- paste0("bush_beans_", names(bean)[181])
names(bean)[211] <- paste0("bush_beans", names(bean)[211])
names(bean)[221] <- paste0("maize_", names(bean)[221])
The best version of English names don’t come from the data labels. They come from another portion of the output. I’ve extracted it here but a key point of feedback for Nathaniel will be to make certain that going forward variable labels are in the right places.
It’s probably safe to assume that if there isn’t a soil code the data can be dropped. It’s not clear how to match the yield data to the soil data. There might be a way to use the client id from the SHS data but I also don’t know if that maps to the M&E data. I could try it if Nathaniel doesn’t have a suggestion.
#names(bean)[grep("soil",names(bean))]
#names(bean)[grep("id",names(bean))]
#table(bean$soil_code, useNA = 'ifany')
pairedSoilDir <- normalizePath(file.path("..", "..", "OAF Soil Lab Folder", "Projects", "rw_shs_16b_paired_climbing", "4_predicted", "other_summaries"))
pairedSoil <- read.csv(file=paste(pairedSoilDir, "combined-predictions-including-bad-ones.csv", sep = "/"))
pSoilIdDir <- normalizePath(file.path("..", "..", "OAF Soil Lab Folder", "Projects", "rw_shs_16b_paired_climbing", "5_merged"))
pSoilIds <- read.csv(file=paste(pSoilIdDir, "database.csv", sep = "/"))
Helpful links: mutate_each and var names to lower
psi <- pSoilIds %>%
setNames(tolower(names(.))) %>%
mutate_each(funs(tolower), district, cell) %>%
rename(ssn = lab.ssn) %>%
mutate(
idDups = duplicated(id) | duplicated(.[nrow(.):1, "id"])[nrow(.):1],
ssnDups = duplicated(ssn) | duplicated(.[nrow(.):1, "ssn"])[nrow(.):1]
)
pairedSoil <- pairedSoil %>%
setNames(tolower(names(.)))
#table(psi$ssn %in% pairedSoil$ssn) # FALSE TRUE 41 703
#table(pairedSoil$ssn %in% psi$ssn) # FALSE TRUE 27 703
pairedSoil <- left_join(pairedSoil, psi, by="ssn") # keeps all paired soil values, no duplicated ids
And now check how many soil ids are duplicated in the bean data. Is there any hope of untangling which ones are suppoed to be which based on the info provided in the soil data?
beanCheck <- bean %>%
filter(!is.na(soil_code)) %>%
mutate(
idDups = duplicated(soil_code) | duplicated(.[nrow(.):1, "soil_code"])[nrow(.):1]
)
beanCheck %>%
filter(idDups==TRUE) %>%
arrange(soil_code) %>%
dplyr::select(district, cell, soil_code)
And let’s compare this to the ids in the soil data to see if we can find matches. If I can, I’ll need to make a new unique id to match them.
#vector of duplicated ids in the bean data
idComps <- unique(beanCheck$soil_code[beanCheck$idDups==TRUE])
pairedSoil %>%
filter(id %in% idComps) %>%
arrange(id) %>%
dplyr::select(district, cell, id)
Visually it doesn’t seem that there are easy matches to be made. We obviously don’t have any -88s or 0s in the id data.
24764 Gitega g doesn’t exist.44337 There are two murambi and we have no further distinguishing info.183004 the name is entirely different.1326301 kibyagira seems to be the best match!9050401 the names are the same.14160102 the names are the same.Fix the one duplicate we can, drop the others and merge the yield data with the soil data. TODO - still waiting on Nathaniel for guidance on how to calculate climbing bean yield. I can take a look at this and see if I can guess.
TODO - follow up with Nathaniel about the soil ids not matching.
bean <- bean[-which(bean$soil_code==1326301 & bean$cell=="Gahira"),]
py <- bean %>%
filter(!is.na(soil_code)) %>%
mutate(
idDups = duplicated(soil_code) | duplicated(.[nrow(.):1, "soil_code"])[nrow(.):1]
) %>%
filter(idDups==FALSE) %>%
rename(ns = id, # change the bean id to something else, nonsense
id = soil_code)
We lose 281 obs to duplicated or useless ids.
loss <- table(py$id %in% pairedSoil$id)[[1]]
#py$id[!py$id %in% pairedSoil$id]
#table(pairedSoil$id %in% py$id)
We then lose 203 to not having matches. We’re not getting good value for our money here.
toJoin <- names(pairedSoil)[c(2:22,25)]
py <- py %>%
inner_join(., pairedSoil[,toJoin], by="id")
I’m going to take a quick guess at how kg/m2 and t/ha yield calculations were made so that I can set up the analyses I want. I’m first incorporating chagnes to the data Alex Villec did in his .do file. See cleans_harvest_16b.do starting on line 85.
py$box_length1 <- ifelse(py$d_box_lenght1!=7 & py$d_box_lenght1!=3, 5, py$d_box_lenght1)
py$box_width1 <- ifelse(py$d_box_width1!=7 & py$d_box_width1!=3, 5, py$d_box_width1)
py$box_length2 <- ifelse(py$d_box_length2!=7 & py$d_box_length2!=3, 5, py$d_box_length2)
py$box_width2 <- ifelse(py$d_box_width2!=7 & py$d_box_width2!=3, 5, py$d_box_width2)
calculateYield <- function(bagA, bagB, lenA, lenB, widthA, widthB, df) {
#convert to numeric
df[,c(bagA, bagB, lenA, lenB, widthA, widthB)] <- sapply(df[,c(bagA, bagB, lenA, lenB, widthA, widthB)], function(x){
as.numeric(as.character(x))
})
# calculate box areas
df$boxAreaA <- df[,lenA] * df[,widthA]
df$boxAreaB <- df[,lenB] * df[,widthB]
df$yieldA <- df[,bagA] / df$boxAreaA
df$yieldB <- df[,bagB] / df$boxAreaB
df$yieldProbsA <- is.na(df$yieldA) | is.infinite(df$yieldA)
df$yieldProbsB <- is.na(df$yieldB) | is.infinite(df$yieldB)
df$yield <- (df[,bagA] + df[,bagB]) / (df$boxAreaA + df$boxAreaB)
df$yield[!df$yieldProbsA & df$yieldProbsB] <-
df$yieldA[!df$yieldProbsA & df$yieldProbsB]
df$yield[!df$yieldProbsB & df$yieldProbsA] <-
df$yieldB[!df$yieldProbsB & df$yieldProbsA]
return(df)
}
py <- calculateYield("box_kg1", "box_kg2", "box_length1", "box_length2", "box_width1", "box_width2", py)
respVar <- c(names(py)[which(names(py)=="ph"): which(names(py)=="x.total.nitrogen")], "yield")
yr <- py[,names(py) %in% respVar]
yr$tha <- yr$yield * 10
soilVars <- names(yr)[which(names(yr)=="ph"):which(names(yr)=="x.total.nitrogen")]
soilYTab <- do.call(plyr::rbind.fill, lapply(soilVars, function(y){
#print(y)
res = iqr.check(yr, y)
#print(dim(res))
out = data.frame(var=rbind(y, paste(y, ".iqr", sep="")), res)
return(out)
}))
soilYTab[,2:length(soilYTab)] <- sapply(soilYTab[,2:length(soilYTab)], function(x){round(x,2)})
The outlier table summarizes the numeric variables with and without IQR outliers to show how the data changes based on this filter.
knitr::kable(soilYTab, row.names = F, digits = 3, format = 'html')
| var | Min. | X1st.Qu. | Median | Mean | X3rd.Qu. | Max. |
|---|---|---|---|---|---|---|
| ph | 4.44 | 5.19 | 5.49 | 5.55 | 5.83 | 9.06 |
| ph.iqr | 4.44 | 5.18 | 5.48 | 5.50 | 5.80 | 6.78 |
| x.ec..salts. | 32.26 | 51.37 | 58.24 | 62.07 | 69.06 | 115.30 |
| x.ec..salts..iqr | 32.26 | 51.00 | 57.89 | 60.17 | 67.32 | 95.29 |
| phosphorus | 5.85 | 12.85 | 18.31 | 21.23 | 25.29 | 97.86 |
| phosphorus.iqr | 5.85 | 12.55 | 17.48 | 19.09 | 24.01 | 43.40 |
| potassium | 94.21 | 142.20 | 157.70 | 164.70 | 182.70 | 273.40 |
| potassium.iqr | 94.21 | 141.50 | 156.70 | 162.70 | 180.70 | 243.10 |
| calcium | 135.60 | 406.80 | 596.60 | 778.70 | 895.30 | 6894.00 |
| calcium.iqr | 135.60 | 398.60 | 560.20 | 629.10 | 806.60 | 1623.00 |
| magnesium | 28.96 | 92.46 | 140.20 | 187.60 | 212.70 | 1799.00 |
| magnesium.iqr | 28.96 | 89.15 | 129.60 | 145.60 | 189.50 | 390.70 |
| manganese | 9.46 | 29.44 | 52.44 | 68.13 | 86.11 | 402.40 |
| manganese.iqr | 9.46 | 28.54 | 48.68 | 57.97 | 78.46 | 170.70 |
| sulphur | 2.00 | 11.62 | 15.63 | 16.33 | 20.43 | 43.34 |
| sulphur.iqr | 2.00 | 11.60 | 15.51 | 16.08 | 20.32 | 31.79 |
| boron | 0.07 | 0.15 | 0.21 | 0.23 | 0.27 | 0.83 |
| boron.iqr | 0.07 | 0.15 | 0.20 | 0.21 | 0.26 | 0.45 |
| zinc | 0.57 | 1.69 | 2.46 | 3.17 | 3.46 | 136.00 |
| zinc.iqr | 0.57 | 1.64 | 2.36 | 2.57 | 3.23 | 6.12 |
| x.c.e.c | 3.35 | 6.32 | 7.68 | 9.18 | 10.36 | 45.81 |
| x.c.e.c.iqr | 3.35 | 6.21 | 7.44 | 8.12 | 9.60 | 16.38 |
| x.exchangeable.aluminium | 0.00 | 0.05 | 0.10 | 0.23 | 0.25 | 5.06 |
| x.exchangeable.aluminium.iqr | 0.00 | 0.04 | 0.09 | 0.13 | 0.17 | 0.55 |
| x.exchangeable.acidity | 0.02 | 0.17 | 0.36 | 0.68 | 0.87 | 4.77 |
| x.exchangeable.acidity.iqr | 0.02 | 0.16 | 0.33 | 0.51 | 0.76 | 1.90 |
| copper | 0.96 | 1.97 | 2.38 | 2.48 | 2.85 | 5.67 |
| copper.iqr | 0.96 | 1.96 | 2.36 | 2.44 | 2.80 | 4.17 |
| aluminium | 543.10 | 844.20 | 995.20 | 1040.00 | 1230.00 | 1794.00 |
| aluminium.iqr | 543.10 | 844.20 | 995.20 | 1040.00 | 1230.00 | 1794.00 |
| x.sodium | 21.39 | 38.33 | 43.83 | 43.73 | 49.01 | 84.82 |
| x.sodium.iqr | 22.90 | 38.33 | 43.78 | 43.53 | 48.89 | 65.00 |
| iron | 124.00 | 187.50 | 206.40 | 215.30 | 234.60 | 415.20 |
| iron.iqr | 124.00 | 186.10 | 205.20 | 209.60 | 229.50 | 304.20 |
| x.phosphorus.sorption.index..psi. | 43.87 | 84.79 | 130.30 | 142.60 | 188.10 | 342.00 |
| x.phosphorus.sorption.index..psi..iqr | 43.87 | 84.79 | 130.30 | 142.60 | 188.10 | 342.00 |
| x.organic.carbon | 1.21 | 1.73 | 1.93 | 2.01 | 2.21 | 3.63 |
| x.organic.carbon.iqr | 1.21 | 1.71 | 1.90 | 1.96 | 2.17 | 2.91 |
| x.total.nitrogen | 0.10 | 0.13 | 0.14 | 0.15 | 0.17 | 0.26 |
| x.total.nitrogen.iqr | 0.10 | 0.13 | 0.14 | 0.14 | 0.16 | 0.21 |
Impose sensible constraints on soil variables
Ask Patrick and Step what those might be. I’ve removed IQR outliers but that’s probably a bit too crude.
mat <- sapply(yr[,soilVars], function(x){
q1 = summary(x)[[2]]
q3 = summary(x)[[5]]
iqr = q3-q1
check = ifelse(x < (q1 - (1.5*iqr)) | x > (q3 + (1.5*iqr)), NA,x)
return(check)
})
mat <- as.data.frame(mat)
names(mat) <-paste0(soilVars, ".iqr")
yr <- cbind(yr, mat)
soilIqr <- names(yr)[which(names(yr)=="ph.iqr"):which(names(yr)=="x.total.nitrogen.iqr")]
Link to the diagPlot and the interpretation of linear diagnostics and guidance on GridExtra
diagPlot<-function(model){
p1 <- ggplot(model, aes(.fitted, .resid)) + geom_point()
p1 <- p1 + stat_smooth(method="loess") + geom_hline(yintercept=0, col="red", linetype="dashed")
p1 <- p1 + xlab("Fitted values")+ylab("Residuals")
p1 <- p1 + ggtitle("Residual vs Fitted Plot")+theme_bw()
#p2Mod <- fortify(model)
p2 <- ggplot() + geom_qq(data=model, aes(sample=.stdresid))
p2<-p2+geom_abline()
p2<-p2+ggtitle("Normal Q-Q")+theme_bw()
p3<-ggplot(model, aes(.fitted, sqrt(abs(.stdresid))))+geom_point(na.rm=TRUE)
p3<-p3+stat_smooth(method="loess", na.rm = TRUE)+xlab("Fitted Value")
p3<-p3+ylab(expression(sqrt("|Standardized residuals|")))
p3<-p3+ggtitle("Scale-Location")+theme_bw()
# p4<-ggplot(model, aes(seq_along(.cooksd), .cooksd))+geom_bar(stat="identity", position="identity")
# p4<-p4+xlab("Obs. Number")+ylab("Cook's distance")
# p4<-p4+ggtitle("Cook's distance")+theme_bw()
p5<-ggplot(model, aes(.hat, .stdresid))+geom_point(aes(size=.cooksd), na.rm=TRUE)
p5<-p5+stat_smooth(method="loess", na.rm=TRUE)
p5<-p5+xlab("Leverage")+ylab("Standardized Residuals")
p5<-p5+ggtitle("Residual vs Leverage Plot")
p5<-p5+scale_size_continuous("Cook's Distance", range=c(1,5))
p5<-p5+theme_bw()+theme(legend.position="bottom")
# p6<-ggplot(model, aes(.hat, .cooksd))+geom_point(na.rm=TRUE)+stat_smooth(method="loess", na.rm=TRUE)
# p6<-p6+xlab("Leverage hii")+ylab("Cook's Distance")
# p6<-p6+ggtitle("Cook's dist vs Leverage hii/(1-hii)")
# p6<-p6+geom_abline(slope=seq(0,3,0.5), color="gray", linetype="dashed")
# p6<-p6+theme_bw()
return(list(rvfPlot=p1,
qqPlot=p2,
sclLocPlot=p3,
#cdPlot=p4,
rvlevPlot=p5
#cvlPlot=p6
))
}
plm <- function(x) { # x is a model
require(gridExtra)
# generate raw tables of useful information
cp <- data.frame(coef(summary(x))) # coefficient and p-values
ci <- data.frame(confint(x)) # 95% confidence intervals
# strip out and format just what we need from cp into another table
names(cp) <- c("Coefficient", "Std.Error", "T", "P")
tab <- cp[, c("Coefficient", "P")]
tab$Coefficient <- signif(tab$Coefficient, digits = 2)
tab$P <- ifelse(tab$P < 0.001, paste("<0.001", "***"),
ifelse(tab$P < 0.01 & tab$P >= 0.001,
paste(signif(tab$P, digits = 2), "**"),
ifelse(tab$P < 0.05 & tab$P >= 0.01,
paste(signif(tab$P, digits = 2), "*"),
ifelse(tab$P < 0.1 & tab$P >= 0.05,
paste(signif(tab$P, digits = 2), "."),
round(tab$P, digits = 2)))))
# add prettified confidence intervals to tab
ci$X2.5.. <- signif(ci$X2.5.., digits = 2)
ci$X97.5.. <- signif(ci$X97.5.., digits = 2)
tab$CI <- paste(ci$X2.5.., ci$X97.5.., sep = " to ")
# rearrange and rename tab
tab <- tab[, c("Coefficient", "CI", "P")]
names(tab) <- c("Coefficient", "95% Confidence Interval", "P-Value")
# remove the district controls, which are always the last
tab = tab[!grepl("district", row.names(tab)), ]
tt = ttheme_default(colhead=list(fg_params = list(parse=TRUE)))
tabOut = tableGrob(tab, theme=tt,
rows=names(x$coefficients))
# make the plot and table
do.call(grid.arrange, diagPlot(x))
grid.arrange(tabOut)
# grid.arrange(
# list(do.call(grid.arrange,diagPlot(x)),
# tabOut),
# nrow=2,
# top="Model Diagnostics"
# )
#making the graphics go together
# output = grid.arrange(plotOut,
# tabOut,
# as.table=TRUE)
#return(output)
#return(do.call(cbind, tmp))
#return(tmp)
}
I’m not entirely certain how to best model yield as a function of soil properties. I’m going to run a handful of models and give some initial caveats.
Obligatory disclaimer: These summaries are intended to show if the regression models are reliable. We know already that the regressions are unable to be directly interpreted. We should try to find more established linear yield models to give our approach a firmer foundation in the literature.
invisible(lapply(soilVars, function(x){
#print(paste0("Soil variable: ", x))
plm(lm(tha ~ yr[,x], data=yr))
}))
invisible(lapply(soilIqr, function(x){
#print(paste0("Soil variable: ", x))
plm(lm(tha ~ yr[,x], data=yr))
}))
Obligatory disclaimer: These curves are simplistic and can’t be taken at face value. We only have soil chemistry data and it’s noisy soil chemistry data. We should follow more established yield models to understand the contribution of individual featuers toward yield response.
respCurve <- function(dat, yVar, xVar, yLab, xLab, gTitle){
ggplot(dat) +
stat_smooth(aes(x = dat[,xVar], y=dat[,yVar]), se=FALSE) +
theme_bw() +
labs(x = xLab, y = yLab, title=gTitle)
}
# response curves
for(i in 1:length(soilVars)){
print(
respCurve(yr, "tha", soilVars[i],"Yield (t/ha)", soilVars[i], gTitle = paste0("Basic response curve: ", soilVars[i]))
)
}
# response curves
for(i in 1:length(soilIqr)){
print(
respCurve(yr, "tha", soilIqr[i],"Yield (t/ha)", soilIqr[i], gTitle = paste0("Basic response curve: ", soilIqr[i]))
)
}